[IR] Add llvm.masked.load.first.fault intrinsic#156470
[IR] Add llvm.masked.load.first.fault intrinsic#156470huntergr-arm wants to merge 7 commits intollvm:mainfrom
Conversation
david-arm
left a comment
There was a problem hiding this comment.
It might be a good idea to split this patch up into two parts:
- A patch to introduce the LangRef and intrinsic along with the scalarisation code (on by default for all targets).
- A follow-on patch to add the TTI hook and the codegen.
What do you think?
| ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | ||
| ; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s | ||
|
|
||
| define { <4 x i32>, <4 x i1> } @speculative_load_v4i32_neon(ptr %p, <4 x i1> %mask) { |
There was a problem hiding this comment.
It might be worth having a separate test file for fixed-width perhaps? For fixed-width I think we want to have different RUN lines for normal SVE and streaming mode. For the latter we can't use NEON and so will have to fall back on SVE.
There was a problem hiding this comment.
Done. I've only included fixed vector tests for now as we won't scalarize scalable vectors.
edd26c4 to
b76e940
Compare
|
@llvm/pr-subscribers-llvm-transforms @llvm/pr-subscribers-backend-aarch64 Author: Graham Hunter (huntergr-arm) ChangesIn order to support loading from addresses which may not be valid at This PR contains both target-independent and AArch64-specific codegen Full diff: https://github.com/llvm/llvm-project/pull/156470.diff 4 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 1c6823be44dcb..92ea6554c4aba 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27098,6 +27098,65 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
+.. _int_mspecload:
+
+'``llvm.masked.speculative.load.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type.
+
+::
+
+ declare { <16 x float>, <16 x i1> } @llvm.masked.speculative.load.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>)
+ declare { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>)
+ ;; The data is a vector of pointers
+ declare { <8 x ptr>, <8 x i1> } @llvm.masked.speculative.load.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Reads a vector from memory according to the provided mask, suppressing faults
+for any lane beyond the first. The mask holds a bit for each vector lane, and
+is used to prevent memory accesses to the masked-off lanes. Inactive lanes will
+be zero in the result vector.
+
+Returns the loaded data and a mask indicating which lanes are valid, which may
+not be the same as the input mask depending on whether the processor encountered
+a reason to avoid loading from that address.
+
+Arguments:
+""""""""""
+
+The first argument is the base pointer for the load. The second argument is the
+alignment of the source location. It must be a power of two constant integer
+value. The third argument, mask, is a vector of boolean values with the same
+number of elements as the return type.
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.speculative.load``' intrinsic is similar to the
+'``llvm.masked.load``' intrinsic, in that it conditionally loads values from
+memory into a vector based on a mask. However, it allows loading from addresses
+which may not be entirely safe. If the memory corresponding to the first element
+of the vector is inaccessible, then a fault will be raised as normal. For all
+subsequent lanes faults will be suppressed and the corresponding bit in the
+output mask will be marked inactive. The remaining elements in the output mask
+after a suppressed fault will also be marked inactive. Elements with active bits
+in the input mask will be poison values if the corresponding bit is inactive in
+the output mask.
+
+Reasons for marking output elements inactive are processor dependent; it may be
+a genuine fault, e.g. if the range of the data being loaded spans a page
+boundary and the page at the higher address is not mapped. But a given
+processor may also mark elements as inactive for other reasons, such as a cache
+miss. Code using this intrinsic must take this into account and not assume that
+inactive lanes signal the end of accessible memory. If more data should be
+loaded based on the semantics of the user code, then the base pointer should be
+advanced to the address of the first inactive element and a new speculative load
+attempted.
Memory Use Markers
------------------
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 4d59ee8676b9e..74fed62fcc1dd 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2505,6 +2505,14 @@ def int_masked_compressstore:
[IntrWriteMem, IntrArgMemOnly,
NoCapture<ArgIndex<1>>]>;
+def int_masked_speculative_load
+ : DefaultAttrsIntrinsic<[llvm_anyvector_ty,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+ [llvm_anyptr_ty, llvm_i32_ty,
+ LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+ [IntrReadMem, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
+ NoCapture<ArgIndex<0>>]>;
+
def int_experimental_vector_compress:
DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
diff --git a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
index 146e7d1047dd0..531d79f89f7cd 100644
--- a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
+++ b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
@@ -1031,6 +1031,68 @@ static void scalarizeMaskedVectorHistogram(const DataLayout &DL, CallInst *CI,
ModifiedDT = true;
}
+static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
+ DomTreeUpdater *DTU,
+ bool &ModifiedDT) {
+ // For a target without speculative/first-faulting load support, we can't
+ // actually scalarize accesses for all lanes. However, lanes beyond the
+ // first may be considered inactive due to reasons beyond a fault, so for
+ // generic 'scalarization' we can just load the first lane (if the
+ // corresponding input mask bit is active), then mark all other lanes as
+ // inactive in the output mask and embed the first lane into a vector of
+ // poison.
+ Value *Ptr = CI->getArgOperand(0);
+ Value *Align = CI->getArgOperand(1);
+ Value *Mask = CI->getArgOperand(2);
+ StructType *RetTy = cast<StructType>(CI->getType());
+ VectorType *DataTy = cast<VectorType>(RetTy->getElementType(0));
+ VectorType *MaskTy = cast<VectorType>(RetTy->getElementType(1));
+ Type *ScalarTy = DataTy->getScalarType();
+
+ MaybeAlign AlignVal = cast<ConstantInt>(Align)->getMaybeAlignValue();
+
+ IRBuilder<> Builder(CI->getContext());
+ BasicBlock *IfBlock = CI->getParent();
+ Builder.SetInsertPoint(CI);
+ Builder.SetCurrentDebugLocation(CI->getDebugLoc());
+ Value *EmptyMask = Constant::getNullValue(MaskTy);
+ Value *PoisonData = PoisonValue::get(DataTy);
+
+ // FIXME: If the mask is a constant, we can skip the extract.
+ Value *FirstActive =
+ Builder.CreateExtractElement(Mask, 0ul, Twine("first.active"));
+ Instruction *ThenTerm =
+ SplitBlockAndInsertIfThen(FirstActive, CI,
+ /*Unreachable=*/false,
+ /*BranchWeights=*/nullptr, DTU);
+
+ BasicBlock *ThenBlock = ThenTerm->getParent();
+ ThenBlock->setName("speculative.load.first.lane");
+ Builder.SetInsertPoint(ThenBlock->getTerminator());
+ LoadInst *Load = Builder.CreateAlignedLoad(ScalarTy, Ptr, AlignVal);
+ Value *OneLaneData = Builder.CreateInsertElement(PoisonData, Load, 0ul);
+ Value *OneLaneMask = Builder.CreateInsertElement(
+ EmptyMask, Constant::getAllOnesValue(MaskTy->getElementType()), 0ul);
+
+ Builder.SetInsertPoint(CI);
+ PHINode *ResData = Builder.CreatePHI(DataTy, 2);
+ ResData->addIncoming(PoisonData, IfBlock);
+ ResData->addIncoming(OneLaneData, ThenBlock);
+ PHINode *ResMask = Builder.CreatePHI(MaskTy, 2);
+ ResMask->addIncoming(EmptyMask, IfBlock);
+ ResMask->addIncoming(OneLaneMask, ThenBlock);
+
+ Value *Result = PoisonValue::get(RetTy);
+ Result = Builder.CreateInsertValue(Result, ResData, 0ul);
+ Result = Builder.CreateInsertValue(Result, ResMask, 1ul);
+ if (CI->hasName())
+ Result->setName(CI->getName() + ".first.lane.only");
+ CI->getParent()->setName("speculative.result");
+ CI->replaceAllUsesWith(Result);
+ CI->eraseFromParent();
+ ModifiedDT = true;
+}
+
static bool runImpl(Function &F, const TargetTransformInfo &TTI,
DominatorTree *DT) {
std::optional<DomTreeUpdater> DTU;
@@ -1170,8 +1232,12 @@ static bool optimizeCallInst(CallInst *CI, bool &ModifiedDT,
scalarizeMaskedCompressStore(DL, HasBranchDivergence, CI, DTU,
ModifiedDT);
return true;
+ case Intrinsic::masked_speculative_load: {
+ scalarizeMaskedSpeculativeLoad(DL, CI, DTU, ModifiedDT);
+ return true;
}
}
+ }
return false;
}
diff --git a/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll b/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll
new file mode 100644
index 0000000000000..e8facb8a5683b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll
@@ -0,0 +1,79 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s --check-prefixes=NEON
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s --check-prefixes=SVE
+
+define { <4 x i32>, <4 x i1> } @speculative_load_v4i32(ptr %p, <4 x i1> %mask) {
+; NEON-LABEL: speculative_load_v4i32:
+; NEON: // %bb.0:
+; NEON-NEXT: // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT: umov w8, v0.h[0]
+; NEON-NEXT: tbz w8, #0, .LBB0_2
+; NEON-NEXT: // %bb.1: // %speculative.load.first.lane
+; NEON-NEXT: adrp x8, .LCPI0_0
+; NEON-NEXT: ldr s0, [x0]
+; NEON-NEXT: ldr d1, [x8, :lo12:.LCPI0_0]
+; NEON-NEXT: // kill: def $d1 killed $d1 killed $q1
+; NEON-NEXT: ret
+; NEON-NEXT: .LBB0_2:
+; NEON-NEXT: movi v1.2d, #0000000000000000
+; NEON-NEXT: // implicit-def: $q0
+; NEON-NEXT: // kill: def $d1 killed $d1 killed $q1
+; NEON-NEXT: ret
+;
+; SVE-LABEL: speculative_load_v4i32:
+; SVE: // %bb.0:
+; SVE-NEXT: // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT: umov w8, v0.h[0]
+; SVE-NEXT: tbz w8, #0, .LBB0_2
+; SVE-NEXT: // %bb.1: // %speculative.load.first.lane
+; SVE-NEXT: adrp x8, .LCPI0_0
+; SVE-NEXT: ldr s0, [x0]
+; SVE-NEXT: ldr d1, [x8, :lo12:.LCPI0_0]
+; SVE-NEXT: // kill: def $d1 killed $d1 killed $q1
+; SVE-NEXT: ret
+; SVE-NEXT: .LBB0_2:
+; SVE-NEXT: movi v1.2d, #0000000000000000
+; SVE-NEXT: // implicit-def: $q0
+; SVE-NEXT: // kill: def $d1 killed $d1 killed $q1
+; SVE-NEXT: ret
+ %res = call { <4 x i32>, <4 x i1> } @llvm.masked.speculative.load.v4i32.p0(ptr %p, i32 16, <4 x i1> %mask)
+ ret { <4 x i32>, <4 x i1> } %res
+}
+
+;; FIXME: If we know the input mask is all-true and the vector is fully aligned,
+;; we should be able to use a normal NEON load here.
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_fully_aligned(ptr %p) {
+; NEON-LABEL: speculative_load_v2f64_all_true_fully_aligned:
+; NEON: // %bb.0: // %speculative.load.first.lane
+; NEON-NEXT: adrp x8, .LCPI1_0
+; NEON-NEXT: ldr d0, [x0]
+; NEON-NEXT: ldr d1, [x8, :lo12:.LCPI1_0]
+; NEON-NEXT: ret
+;
+; SVE-LABEL: speculative_load_v2f64_all_true_fully_aligned:
+; SVE: // %bb.0: // %speculative.load.first.lane
+; SVE-NEXT: ldr d0, [x0]
+; SVE-NEXT: index z1.s, #1, #-1
+; SVE-NEXT: // kill: def $d1 killed $d1 killed $z1
+; SVE-NEXT: ret
+ %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 16, <2 x i1> <i1 true, i1 true>)
+ ret { <2 x double>, <2 x i1> } %res
+}
+
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_partially_aligned(ptr %p) {
+; NEON-LABEL: speculative_load_v2f64_all_true_partially_aligned:
+; NEON: // %bb.0: // %speculative.load.first.lane
+; NEON-NEXT: adrp x8, .LCPI2_0
+; NEON-NEXT: ldr d0, [x0]
+; NEON-NEXT: ldr d1, [x8, :lo12:.LCPI2_0]
+; NEON-NEXT: ret
+;
+; SVE-LABEL: speculative_load_v2f64_all_true_partially_aligned:
+; SVE: // %bb.0: // %speculative.load.first.lane
+; SVE-NEXT: ldr d0, [x0]
+; SVE-NEXT: index z1.s, #1, #-1
+; SVE-NEXT: // kill: def $d1 killed $d1 killed $z1
+; SVE-NEXT: ret
+ %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 8, <2 x i1> <i1 true, i1 true>)
+ ret { <2 x double>, <2 x i1> } %res
+}
|
Yep, that's fine. I'll use this PR for the intrinsic and raise the AArch64 PR after this lands. |
| static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI, | ||
| DomTreeUpdater *DTU, | ||
| bool &ModifiedDT) { | ||
| // For a target without speculative/first-faulting load support, we can't |
There was a problem hiding this comment.
I think it'd be good to have an IR test with -passes=scalarize-masked-mem-intrin, similar to llvm/test/Transforms/ScalarizeMaskedMemIntrin/AArch64/streaming-compatible-expand-masked-gather-scatter.ll.
| ModifiedDT = true; | ||
| } | ||
|
|
||
| static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI, |
There was a problem hiding this comment.
Other scalarize* functions have an a doc comment with an example, I think that'd be worth adding here too.
llvm/docs/LangRef.rst
Outdated
|
|
||
| Syntax: | ||
| """"""" | ||
| This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type. |
| Value *EmptyMask = Constant::getNullValue(MaskTy); | ||
| Value *PoisonData = PoisonValue::get(DataTy); | ||
|
|
||
| // FIXME: If the mask is a constant, we can skip the extract. |
There was a problem hiding this comment.
Is it worth handling this here? -passes=instcombine,simplifycfg would do the same.
llvm/docs/LangRef.rst
Outdated
| of the vector is inaccessible, then a fault will be raised as normal. For all | ||
| subsequent lanes faults will be suppressed and the corresponding bit in the |
There was a problem hiding this comment.
nit
| of the vector is inaccessible, then a fault will be raised as normal. For all | |
| subsequent lanes faults will be suppressed and the corresponding bit in the | |
| of the vector is inaccessible, then a fault will be raised as normal. For all subsequent | |
| lanes, if a fault occurs, it will be suppressed and the corresponding bit in the |
llvm/docs/LangRef.rst
Outdated
| the output mask. | ||
|
|
||
| Reasons for marking output elements inactive are processor dependent; it may be | ||
| a genuine fault, e.g. if the range of the data being loaded spans a page |
There was a problem hiding this comment.
Worth saying a reason for marking lanes inactive could also be lack of support for suppressing faults?
364c8c4 to
7a8355f
Compare
|
Changed to just focus on first faulting loads, since there's already a .vp evl based intrinsic for this and #179642 should cover the page-based speculation. |
🪟 Windows x64 Test Results
✅ The build succeeded and all tests passed. |
llvm/docs/LangRef.rst
Outdated
| declare { <16 x float>, <16 x i1> } @llvm.masked.load.first.fault.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>) | ||
| declare { <2 x double>, <2 x i1> } @llvm.masked.load.first.fault.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>) | ||
| ;; The data is a vector of pointers | ||
| declare { <8 x ptr>, <8 x i1> } @llvm.masked.load.first.fault.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>) |
There was a problem hiding this comment.
Do we need a passthru argument similar to the generic masked loads? cc @paulwalker-arm
There was a problem hiding this comment.
I'm not sure we want a passthru for other lanes, given that we will likely stop processing anything past a 'fault' for something loaded to determine early exit conditions in the current vector iteration?
The equivalent evl intrinsic (llvm.vp.load.ff) doesn't have a passthru either.
There was a problem hiding this comment.
I agree, defining the other lanes as poison seems fine. If users want to set them to a known value they can use a select instruction to set them.
fhahn
left a comment
There was a problem hiding this comment.
IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?
If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?
If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff
Sure, I can use It returns a mask instead of a new evl, but that mask would be a 'prefix' mask so could be synthesized from the returned evl. As far as the overlap in functionality goes, it's pretty much the same as llvm.masked.load vs. llvm.vp.load, or store,gather,scatter,etc. Would operand bundles be the way to make an argument optional? At some point, Arm would probably also want to introduce a |
It would be possible to make it a variable argument intrinsic, but that isn't nice I think unfortunately, as it would require spelling out the full function type for each call IIRC. Operand bundles may be a way, but I am not sure if they are super well supported everywhere. Maybe for the new intrinsic as first step, it would be sufficient to share the description of the semantics, and link between them in the LangRef? |
1bc1d81 to
15eded6
Compare
|
Renamed to use a |
In order to support loading from addresses which may not be valid at runtime without generating faults, we introduce the first fault load intrinsic. Loading with this intrinsic will only generate a fault for invalid accesses on the first element of the vector. Any subsequent fault will be suppressed and the corresponding data will be poison. This PR contains target-independent scalarization of the intrinsic so that generic codegen works.
15eded6 to
bed61d9
Compare
| The '``llvm.masked.load.ff``' intrinsic is very similar to the | ||
| '``llvm.vp.load.ff``' intrinsic, with the differences being the lack of an EVL | ||
| parameter and the second returned value being a mask instead of an updated EVL | ||
| value. | ||
|
|
There was a problem hiding this comment.
Given that VP intrinsics are usually defined as a "layer above" masked intrinsics, should we instead modify the documentation llvm.vp.load.ff to mention it's similar to llvm.masked.load.ff instead?
| If the processor suppresses a fault for any lane, then the returned mask will | ||
| indicate that lane and all subsequent lanes are inactive. |
There was a problem hiding this comment.
Does this mean that the mask is a prefix mask? I.e. after the first set bit, every bit after will be set. I think we will need to depend on that property in the loop vectorizer. Does SVE guarantee that?
There was a problem hiding this comment.
We tried to add expansion for vp.load.ff too in #154440 but I'm not sure if it's worth it.
The scalarization for the vp intrinsic works the same way as here where we rely on the fact that the intrinsic can arbitrarily reduce the number of non-faulted lanes to one.
But in that case I don't think it will ever be profitable to scalarize these intrinsics, because it means that a vectorized loop will always just be stepping through one element at a time. I.e. it's now a scalar loop.
I think we should just be returning an invalid cost if the target can't lower these.
In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the first fault load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.
This PR contains target-independent scalarization of the intrinsic so
that generic codegen works.